WRANGLERS
Photo by Alex Alvarez on Unsplash
The World Happiness Report has proven to be an indispensable tool for policymakers
looking to better understand what makes people happy…
— Jeffrey Sachs
df <- read_xls('./archetypes/happiness-report/happiness-report-2020.xls')
df
dim(df)
## [1] 1704 26
The output tells us that the data contains 1704 rows, and 26 columns.
glimpse(df)
## Rows: 1,704
## Columns: 26
## $ `Country name` <chr> "Afghanista~
## $ Year <dbl> 2008, 2009,~
## $ `Life Ladder` <dbl> 3.723590, 4~
## $ `Log GDP per capita` <dbl> 7.168690, 7~
## $ `Social support` <dbl> 0.4506623, ~
## $ `Healthy life expectancy at birth` <dbl> 50.80, 51.2~
## $ `Freedom to make life choices` <dbl> 0.7181143, ~
## $ Generosity <dbl> 0.177888572~
## $ `Perceptions of corruption` <dbl> 0.8816863, ~
## $ `Positive affect` <dbl> 0.5176372, ~
## $ `Negative affect` <dbl> 0.2581955, ~
## $ `Confidence in national government` <dbl> 0.6120721, ~
## $ `Democratic Quality` <dbl> -1.92968965~
## $ `Delivery Quality` <dbl> -1.6550844,~
## $ `Standard deviation of ladder by country-year` <dbl> 1.774662, 1~
## $ `Standard deviation/Mean of ladder by country-year` <dbl> 0.4765997, ~
## $ `GINI index (World Bank estimate)` <dbl> NA, NA, NA,~
## $ `GINI index (World Bank estimate), average 2000-16` <dbl> NA, NA, NA,~
## $ `gini of household income reported in Gallup, by wp5-year` <dbl> NA, 0.44190~
## $ `Most people can be trusted, Gallup` <dbl> NA, 0.28631~
## $ `Most people can be trusted, WVS round 1981-1984` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1989-1993` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1994-1998` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1999-2004` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 2005-2009` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 2010-2014` <dbl> NA, NA, NA,~
The above provides a little more information. For example, we see that ‘Country name’ is a column of characters char, and that all other columns are numbers dbl. This is useful because we can already guess that ‘Year’ does not have the right type. It should not be treated as a number. We will fix it with the next command. Also notice the beginning values of each column; this is useful to get familiar with the data on hand. Some columns display a lot of NA, which indicates the absence of data.
df <- df %>% mutate(Year = as.factor(Year))
str(df$Year)
## Factor w/ 14 levels "2005","2006",..: 4 5 6 7 8 9 10 11 12 13 ...
Column ‘Year’ is now a datatype called factor, which is a type of categorical variables in R. Levels are the possible values in the factor.
glimpse(df)
## Rows: 1,704
## Columns: 26
## $ `Country name` <chr> "Afghanista~
## $ Year <fct> 2008, 2009,~
## $ `Life Ladder` <dbl> 3.723590, 4~
## $ `Log GDP per capita` <dbl> 7.168690, 7~
## $ `Social support` <dbl> 0.4506623, ~
## $ `Healthy life expectancy at birth` <dbl> 50.80, 51.2~
## $ `Freedom to make life choices` <dbl> 0.7181143, ~
## $ Generosity <dbl> 0.177888572~
## $ `Perceptions of corruption` <dbl> 0.8816863, ~
## $ `Positive affect` <dbl> 0.5176372, ~
## $ `Negative affect` <dbl> 0.2581955, ~
## $ `Confidence in national government` <dbl> 0.6120721, ~
## $ `Democratic Quality` <dbl> -1.92968965~
## $ `Delivery Quality` <dbl> -1.6550844,~
## $ `Standard deviation of ladder by country-year` <dbl> 1.774662, 1~
## $ `Standard deviation/Mean of ladder by country-year` <dbl> 0.4765997, ~
## $ `GINI index (World Bank estimate)` <dbl> NA, NA, NA,~
## $ `GINI index (World Bank estimate), average 2000-16` <dbl> NA, NA, NA,~
## $ `gini of household income reported in Gallup, by wp5-year` <dbl> NA, 0.44190~
## $ `Most people can be trusted, Gallup` <dbl> NA, 0.28631~
## $ `Most people can be trusted, WVS round 1981-1984` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1989-1993` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1994-1998` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 1999-2004` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 2005-2009` <dbl> NA, NA, NA,~
## $ `Most people can be trusted, WVS round 2010-2014` <dbl> NA, NA, NA,~
missing_stats <- purrr::map_df(df, ~ sum(is.na(.))) %>%
gather('Column name', 'Count of missing values')
missing_stats
We now know that the first 3 Columns are not missing any value, but ‘Log GDP per capita’ has 28 missing values.
For Country Name:
distinct_df <- distinct(df,`Country name`) %>% arrange(`Country name`)
distinct_df
For Year:
distinct_df <- distinct(df, Year) %>% arrange(Year)
distinct_df
df_1= table(df$Year)
df_2 <- as.data.frame(df_1) %>%
dplyr::rename(Year = Var1, Freq_absolute = Freq) %>%
mutate(Freq_relative=paste0(round(100*Freq_absolute/sum(Freq_absolute),digits=2),"%"))
df_2
For the year 2008, we have 110 records, which represents about 6.5% of the entire dataset.
df1 <- df[,3:ncol(df)]
nRows <- dim(df1)[1]
calcStats <- function(x) {
temp <- na.omit(df[, x])
pos <- sum(temp > 0)
is_zero <- sum(temp = 0)
neg <- sum(temp < 0)
c("number of positives" = pos, "negatives" = neg, "zero" = is_zero)
}
result <- as.data.frame(Map(calcStats, colnames(df1)))
result
df_long <- df %>%
pivot_longer(
`Life Ladder`:`Most people can be trusted, WVS round 2010-2014`,
names_to = "measure",
values_to = "value"
)
v1 <- ggplot(df_long, aes(x=value)) +
geom_histogram(fill = "#79B8E5") +
facet_wrap(~ measure, scales="free")+
theme(panel.grid = element_blank(),
strip.background = element_blank(),
panel.background = element_blank()
)
girafe(ggobj = v1, width_svg = 16, height_svg = 9, options =
list(opts_sizing(rescale = TRUE, width = 1.0))
)
As we will see later in the course, the two variables of interest will be “Life Ladder” and “Log GDP per capita”. Just looking at the graph, we see that the “Life Ladder” varies from 0 to 8, with the majority of values between 4 and 6. “Log GDP per capita” varies from about 5 to 12. So the GDP per person would vary in this dataset from $150 to $160k.
@misc{helliwell_2019_world,
author = { Helliwell, John F. Helliwell and Layard, Richard and Sachs, Jeffrey D. },
title = {World Happiness Report 2019},
url = {https://worldhappiness.report/ed/2019/},
urldate = {2021-05-18},
year = {2019},
organization = {Worldhappiness.report}
}